One of the key activities of any IT function is to “Keep the lights on” to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact. In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings. Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve, they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3 teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers / Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support towards incident closure. L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams.
During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service
Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Details about the data and dataset files are given in below link, https://drive.google.com/open?id=1OZNJm81JXucV3HmZroMq6qCT2m7ez7IJ
Pre-Processing, Data Visualization and EDA:
Model Building:
Test the Model, Fine-tuning and Repeat:
In this capstone project, the goal is to build a classifier that can classify the tickets by analyzing text.
The objective of the project is,
# Setting the current working directory
import os; os.chdir('/content/drive/My Drive/AIML/CapstoneProject')
#Install libraries which are not available
!pip install preprocessing
!pip install googletrans
!pip install langdetect
!pip install -U gensim
!pip install ftfy wordcloud goslate spacy plotly cufflinks nltk rake-nltk fasttext pyLDAvis
Here the dataset is an excel file and use pandas library to load the excel file(dataset) to a pandas dataframe.
%tensorflow_version 2.x
import tensorflow
tensorflow.__version__
# Import packages
import warnings
warnings.filterwarnings('ignore')
import pandas as pd, numpy as np, tensorflow as tf
import matplotlib.pyplot as plt, seaborn as sns
import matplotlib.style as style
from sklearn import preprocessing
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
import random, re
assert tf.__version__ >= '2.0'
%matplotlib inline
from preprocessing import *
from itertools import islice
# Models
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
# Set random state
random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)
# Loading Dataset
tickets_df = pd.read_excel('input_data.xlsx')
print(f'Data has {tickets_df.shape[0]} rows and {tickets_df.shape[1]} columns. Here are the first five rows of the data...')
print("\n")
display(tickets_df.head())
print("\n")
print("\nHere are the last five rows of the data...\n")
display(tickets_df.tail())
In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.
Feature engineering: The process of creating new features from raw data to increase the predictive power of the learning algorithm. Engineered features should capture additional information that is not easily apparent in the original feature set.
Feature selection: The process of selecting the key subset of features to reduce the dimensionality of the training problem.
It provides a high-level interface for drawing attractive and informative statistical graphics. Data visualization is an important part of analysis since it allows even non-programmers to be able to decipher trends and patterns
Here, we start by identifying the basic traits of the data that is rows and columns
#Shape & Size of the dataset
print('No of rows:\033[1m', tickets_df.shape[0], '\033[0m')
print('No of cols:\033[1m', tickets_df.shape[1], '\033[0m')
#info on dataset
tickets_df.info()
Short description: Ticket Issue title or short desciption about the issue. Sometimes, the issue is understood from short desciption itself.
Description: Detailed explanation of issue and the scenario.
Caller: The person who raised the ticket or raised it behalf of someone.
Assignment group: The group/category to which the ticket is assigned.
#detail info on dataset
tickets_df.describe().T
# Find out the null value counts in each column
tickets_df.isnull().sum()
We observe that all columns don't have 8500 non null values, means there are null vales in data that we have to take care of We can observe that the data is higly imbalanced & skewed.
Total records: 8500 & Total Attributes: 4
Since our goal is automatic ticket assignment, It doesn't depend on the caller. Also a caller can raise tickets for any issue he/she is facing. Therefore, we can ignore the caller attribute/feature for further detail analysis.
There are 8 records with null value in short description and 1 record with null value in description.
One user/caller has raised 810 tickets.
There are 74 unique groups and the group GRP_0 has been assigned 3976 tickets. GRP_0 has maximum instances around ~40%
Top description is just the word 'the' which also we have to take care of.
Short description & description count doesn't match with the total no of callers or assigned groups.
Password Reset is one of the most occuring ticket topic.
# Check the records with null values
tickets_df[pd.isnull(tickets_df).any(axis=1)]
Now we will be dropping the caller attribute from the data set since it doesn't make an impact in our task of automatic ticket assignment, and then find out the unique groups that are there in the data
# Drop the 'Caller' attribute, since it doesn't make an impact in our task of automatic ticket assignment
df_incidents = tickets_df.drop('Caller',axis=1)
#Unique Groups
unique_grp = df_incidents['Assignment group'].unique()
unique_grp
Here we are creating a dataframe which shows us the percentage of data present in the given groups
df_inc = df_incidents['Assignment group'].value_counts().reset_index()
df_inc['percentage'] = (df_inc['Assignment group']/df_inc['Assignment group'].sum())*100
df_inc.head()
# Plot to visualize the percentage data distribution across different groups
sns.set(style="whitegrid")
plt.figure(figsize=(20,5))
ax = sns.countplot(x="Assignment group", data=df_incidents, order=df_incidents["Assignment group"].value_counts().index)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
for p in ax.patches:
ax.annotate(str(format(p.get_height()/len(df_incidents.index)*100, '.2f')+"%"), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'bottom', rotation=90, xytext = (0, 10), textcoords = 'offset points')
This plot above, shows the percentage of data distribution across different groups within the data. The data below shows us the top 20 groups that have the most number of records.
df_top_20 = df_incidents['Assignment group'].value_counts().nlargest(20).reset_index()
df_top_20
plt.figure(figsize=(12,6))
bars = plt.bar(df_top_20['index'],df_top_20['Assignment group'])
plt.title('Top 20 Assignment groups with highest number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
Here we have the plot of the top 20 groups that have the data assigned to them.
Below, we are checking the data that has the least number of groups assigned.
df_bottom_20 = df_incidents['Assignment group'].value_counts().nsmallest(20).reset_index()
df_bottom_20
plt.figure(figsize=(12,6))
bars = plt.bar(df_bottom_20['index'],df_bottom_20['Assignment group'])
plt.title('Bottom 20 Assignment groups with small number of Tickets')
plt.xlabel('Assignment Group')
plt.xticks(rotation=90)
plt.ylabel('Number of Tickets')
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x(), yval + .005, yval)
plt.tight_layout()
plt.show()
Here we have the plot of the bottom 20 groups that have the data assigned to them.
df_bins = pd.DataFrame(columns=['Description','Ticket Count'])
one_ticket = {'Description':'1 ticket','Ticket Count':len(df_inc[df_inc['Assignment group'] < 2])}
_2_5_ticket = {'Description':'2-5 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 1)& (df_inc['Assignment group'] < 6) ])}
_10_ticket = {'Description':' 6-10 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 5)& (df_inc['Assignment group'] < 11)])}
_10_20_ticket = {'Description':' 11-20 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 10)& (df_inc['Assignment group'] < 21)])}
_20_50_ticket = {'Description':' 21-50 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 20)& (df_inc['Assignment group'] < 51)])}
_51_100_ticket = {'Description':' 51-100 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 50)& (df_inc['Assignment group'] < 101)])}
_100_ticket = {'Description':' >100 ticket',
'Ticket Count':len(df_inc[(df_inc['Assignment group'] > 100)])}
#append row to the dataframe
df_bins = df_bins.append([one_ticket,_2_5_ticket,_10_ticket,
_10_20_ticket,_20_50_ticket,_51_100_ticket,_100_ticket], ignore_index=True)
df_bins
In the dataframe above, we identify the number of groups that have tickets assigned to them in specific ranges. Ex : There are 6 groups that have just 1 ticket assigned to them.
Thus to eliminate the null values, we'll be replacing them with space.
df_incidents[df_incidents['Short description'].isnull()]
df_incidents[df_incidents['Description'].isnull()]
Thus to eliminate the null values, we'll be replacing them with space.
#Replace NaN values in Short Description and Description columns
df_incidents['Short description'] = df_incidents['Short description'].replace(np.nan, '', regex=True)
df_incidents['Description'] = df_incidents['Description'].replace(np.nan, '', regex=True)
df_incidents.describe().T
Here, we are joining the two columns into a new columns because of the null values we replaced above, some of the short description/description columns have just spaces as their text. If this is the case, then we would have to join the columns, so that while making the vocabulary, we can refer to a single column to work with.
#Concatenate Short Description and Description columns
df_incidents['New_Description'] = df_incidents['Short description'] + ' ' +df_incidents['Description']
We do this concatenation here, to help with data pre-processing & data cleansing.
Later we will remove the repeated words in each combined description.
df_incidents.head()
A wordcloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. Here in the below wordcloud, we intend to do the same.
def f_word_cloud(column):
comment_words = ' '
stopwords = set(STOPWORDS)
# iterate through the csv file
for val in column:
# typecaste each val to string
val = str(val)
# split the value
tokens = val.split()
# Converts each token into lowercase
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
for words in tokens:
comment_words = comment_words + words + ' '
wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)
return wordcloud
Wordcloud image of the description.
from wordcloud import WordCloud, STOPWORDS
wordcloud = f_word_cloud(df_incidents.New_Description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Lets view the word cloud of top 4 assignment groups to see the kind of tickets assigned to them Word Cloud for tickets with Assignment group 'GRP_0'
wordcloud = f_word_cloud(df_incidents[df_incidents['Assignment group']=='GRP_0'].New_Description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Word Cloud for tickets with Assignment group 'GRP_8'.
wordcloud = f_word_cloud(df_incidents[df_incidents['Assignment group']=='GRP_8'].New_Description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_8 seems to have tickets related to outage, job failures, monitoring tool etc.
Word Cloud for tickets with Assignment group 'GRP_12'
wordcloud = f_word_cloud(df_incidents[df_incidents['Assignment group']=='GRP_12'].New_Description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_12 contains tickets related to systems like disk space issues, t network issues like tie out, citrix issue, connectivity timeout etc.
Word Cloud for tickets with Assignment group 'GRP_24'.
wordcloud = f_word_cloud(df_incidents[df_incidents['Assignment group']=='GRP_24'].New_Description)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
GRP_24 - Tickets are mainly in german, these tickets need to be translated to english before passing it to our model.
Seems like there are few tickets with description in some other language, probably in German.
Text encoding transforms words into numbers and texts into number vectors. And in the given data set we also found that ther are a lot of entries/records that are in multiple languages. Thus we need them to translate as well to one language, that is easily understandable. Here we choose english. Also, there are some special characters in some records, so we will be translating them as well.
df_incidents[df_incidents['Assignment group']=='GRP_24'].New_Description
df_incidents.shape
Language Detection
#Lets encode the string, to make it easier to be passed to language detection api.
def fn_decode_to_ascii(df):
text = df.encode().decode('utf-8').encode('ascii', 'ignore')
return text.decode("utf-8")
df_incidents['New_Description'] = df_incidents['New_Description'].apply(fn_decode_to_ascii)
In the below code, we are now using langdetect library to detect the languages used in the data set.
from langdetect import detect
def fn_lan_detect(df):
try:
return detect(df)
except:
return 'no'
df_incidents['language'] = df_incidents['New_Description'].apply(fn_lan_detect)
#Languages detected
df_incidents["language"].value_counts()
x = df_incidents["language"].value_counts()
x=x.sort_index()
plt.figure(figsize=(10,6))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Distribution of text by language")
plt.ylabel('number of records')
plt.xlabel('Language')
rects = ax.patches
labels = x.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show();
Language Translation
We can see that most of the tickets are in english, followed by tickets in German language. We need to translate these into english. We will be using google translate package to translate
import googletrans
from googletrans import Translator
translator = Translator()
# Function to translate the text to english.
def fn_translate(df,lang):
try:
if lang == 'en':
return df
else:
return translator.translate(df,dest='en', src=lang).text
except:
return df
df_incidents['English_Description'] = df_incidents.apply(lambda x: fn_translate(x['New_Description'], x['language']), axis=1)
Google Translate API is used for translating the non-english text. However, there is limit imposed due to garbage values and non-ascii symbols preventing proper translation.
So the traslation was done in 2 batches:
df_incidents[df_incidents["Short description"].str.contains("account lock")]["Assignment group"].value_counts()
df_incidents[df_incidents["Short description"].str.contains("oneteam")]["Assignment group"].value_counts()
df_incidents.head(10)
df_incidents[df_incidents.language=='no']['Assignment group'].unique()
lang_list = df_incidents.language.unique().tolist()
list_groups=[]
for lang in lang_list:
if not lang=='en':
#print(lang)
list_groups=list_groups+(df_incidents[df_incidents.language==lang]['Assignment group'].unique().tolist())
print(len(list(set(list_groups))))
len(list(set(list_groups)))
list_groups
df_incidents.to_csv("inc_tranlated.csv",index=False)
df_incidents.head()
df_tranlated_text = pd.read_csv('inc_tranlated.csv',encoding='utf-8')
df_tranlated_inc = df_tranlated_text.drop(['Short description','Description','New_Description'],axis=1)
df_tranlated_inc.English_Description=df_tranlated_inc.English_Description.astype(str)
df_tranlated_inc.head()
df_tranlated_text.tail()
Exploring the different language disribution in the Dataset
#Unique Languages & Unique grps
det_lang = df_tranlated_text['language']
det_lang2 = np.array(det_lang)
det_lang2 = np.unique(det_lang)
det_lang2
print("Total No: of languages detected: ", det_lang2.size)
# Value Counts of Language distribution
lang_valcount = df_tranlated_text['language'].value_counts()
print(lang_valcount)
det_lang3 = np.array(det_lang)
occurrences = np.count_nonzero(det_lang3 != 'en')
print("Non-English language count: ", occurrences)
# Find Language Distribution in Groups
grplang_df = pd.DataFrame(df_tranlated_text['Assignment group'].unique(),columns=['AsgnGrp'])
asgLngGrp = []
for ct2 in grplang_df.itertuples():
strVar = []
for ct1 in df_tranlated_text.itertuples():
if ct2.AsgnGrp == ct1._3:
strVar.append(ct1.language)
strArrVar = str(np.unique(strVar))
asgLngGrp.append(strArrVar)
Group-wise Language Distribution
grplang_df['AsgnLang'] = asgLngGrp
pd.set_option('display.max_rows', None)
print(grplang_df)
Observation:
From above table and bar chart, we can observe that the language is distributed across groups and are not specific to certain groups alone.
#Pie Chart of target group
df_tranlated_text['Assignment group'].value_counts().plot(kind = 'pie', autopct = '%.0f%%', labels = grplang_df['AsgnGrp'], figsize = (30, 30))
# Wordcloud before data cleaning & after language translation
wordcloudImg2 = WordCloud().generate(str(df_tranlated_text["English_Description"]))
plt.figure(figsize=(30,20))
plt.imshow(wordcloudImg2, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
df_tranlated_inc.shape
import string
import spacy
from nltk import tokenize
Lowercasing: converting the words into lower case format. (NLU -> nlu). Words having the same meaning like nlp and NLP if they are not converted into lowercase then these both will constitute as non-identical words in the vector space model.
Stop words removal: These are the most often used that do not have any significance while determining the two different documents like (a, an, the, etc.) so they are to be removed.
Contextual conversational words removal: In our case, words like 'recieved from','to','regards','subject','email address', which are identified as words used in an standard email converstaions to be removed
Punctuation: The text has several punctuations. Punctuations are often unnecessary as it doesn’t add value or meaning to the NLP model.
Other steps: Other cleaning steps can be performed based on the data. Listed a few of them below,Remove URLs, Remove HTML tags, Remove numbers, Remove hashtags etc are also used here
# Define a function to clean the data
import re
def clean_data(text):
text = text.lower()
text = ' '.join([w for w in text.split()])
text = re.sub(r"received from:",' ',text)
text = re.sub(r"from:",' ',text)
text = re.sub(r"to:",' ',text)
text = re.sub(r"subject:",' ',text)
text = re.sub(r"sent:",' ',text)
text = re.sub(r"ic:",' ',text)
text = re.sub(r"cc:",' ',text)
text = re.sub(r"bcc:",' ',text)
text = re.sub(r"email:",' ',text)
text = re.sub(r"from:",' ',text)
text = re.sub(r"email address:",' ',text)
text = re.sub(r"subject:",' ',text)
text = re.sub(r"cid:image",' ',text)
text = re.sub(r"this message was sent from an unmonitored email address",' ',text)
text = re.sub(r"please do not reply to this message",' ',text)
text = re.sub(r"monitoring_tool@company.com",' ',text)
text = re.sub(r"MonitoringTool",' ',text)
text = re.sub(r"select the following link to view the disclaimer in an alternate language",' ',text)
text = re.sub(r"description problem",' ',text)
text = re.sub(r"steps taken far",' ',text)
text = re.sub(r"customer job title",' ',text)
text = re.sub(r"sales engineer contact",' ',text)
text = re.sub(r"description of problem:",' ',text)
text = re.sub(r"steps taken so far",' ',text)
text = re.sub(r"please do the needful",' ',text)
text = re.sub(r"please note that",' ',text)
text = re.sub(r"please find below",' ',text)
text = re.sub(r"date and time",' ',text)
text = re.sub(r"kindly refer mail",' ',text)
text = re.sub(r"name:",' ',text)
text = re.sub(r"language:",' ',text)
text = re.sub(r"customer number:",' ',text)
text = re.sub(r"telephone:",' ',text)
text = re.sub(r"summary:",' ',text)
text = re.sub(r"sincerely",' ',text)
text = re.sub(r"company inc",' ',text)
text = re.sub(r"importance:",' ',text)
text = re.sub(r"gmail.com",' ',text)
text = re.sub(r"company.com",' ',text)
text = re.sub(r"microsoftonline.com",' ',text)
text = re.sub(r"company.onmicrosoft.com",' ',text)
text = re.sub(r"hello",' ',text)
text = re.sub(r"hallo",' ',text)
text = re.sub(r"hi it team",' ',text)
text = re.sub(r"hi team",' ',text)
text = re.sub(r"hi",' ',text)
text = re.sub(r"best regards",' ',text)
text = re.sub(r"kind regards",' ',text)
text = re.sub(r"regards",' ',text)
text = re.sub(r"good morning",' ',text)
text = re.sub(r"please",' ',text)
text = re.sub(r"kindly",' ',text)
#Remove email
text = re.sub(r'\S*@\S*\s?', '', text)
# Remove numbers
text = re.sub(r'\d+','' ,text)
# Remove new line characters
text = re.sub(r'\n',' ',text)
# Remove hashtag while keeping hashtag text
text = re.sub(r'#','', text)
text = re.sub(r'&;?', 'and',text)
# Remove HTML special entities (e.g. &)
text = re.sub(r'\&\w*;', '', text)
# Remove hyperlinks
text = re.sub(r'https?:\/\/.*\/\w*', '', text)
# Remove characters beyond Readable formart by Unicode:
text= ''.join(c for c in text if c <= '\uFFFF')
text = text.strip()
# Remove unreadable characters (also extra spaces)
text = ' '.join(re.sub("[^\u0030-\u0039\u0041-\u005a\u0061-\u007a]", " ", text).split())
text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text)
text = re.sub(' +', ' ', text)
text = text.strip()
return text
df_tranlated_inc['cleaned_description'] = df_tranlated_inc['English_Description'].apply(lambda x: clean_data(x))
df_tranlated_inc.drop(['English_Description'],axis=1,inplace=True)
df_tranlated_inc['cleaned_description'].head()
Language Translation after cleaning & before stop words removal & tokenization
df_tranlated_inc['cleaned_description'] = df_tranlated_inc.apply(lambda x: fn_translate(x['cleaned_description'], x['language']), axis=1)
df_tranlated_inc.head()
## Removal of Stop Words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
df_tranlated_inc['cleaned_description'] = df_tranlated_inc['cleaned_description'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop))
df_tranlated_inc['cleaned_description'].head()
Remove duplicates in combined descriptions
desc_Arr = []
for tk1 in df_tranlated_inc.itertuples():
texArr = []
texArr = list(tk1.cleaned_description.split(" "))
str1 = " "
texStr = str1.join(np.unique(texArr))
desc_Arr.append(texStr)
This step was necessary to remove the duplicate words which were formed due to the concatenation of short description & long description columns of the dataset.
These duplicate words would increase the word counts and possibly impact the model building steps and the performance of models.
desc_Arr[4]
df_tranlated_inc['cleaned_description'] = desc_Arr
df_tranlated_inc.head()
N-grams of texts are extensively used in text mining and natural language processing tasks.
They are basically a set of co-occuring words within a given window. N-grams is a contiguous sequence of N items from a given sample of text or speech, in the fields of computational linguistics and probability. The items can be phonemes, syllables, letters, words or base pairs according to the application. N-grams are used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means 2-worded phrase, and trigram means 3-worded phrase. We'll be using scikit-learn’s CountVectorizer function to derive n-grams We will write a generic method to derive the n-grams.
# Generic function to derive top N n-grams from the corpus
from sklearn.feature_extraction.text import CountVectorizer
def get_top_n_ngrams(corpus, top_n=None, ngram_range=(1,1), stopwords=None):
vec = CountVectorizer(ngram_range=ngram_range,
stop_words=stopwords).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:top_n]
# Top Unigrams after removing stop words
top_n = 50
ngram_range = (1,1)
uni_grams_sw = get_top_n_ngrams(df_tranlated_inc.cleaned_description, top_n, ngram_range, stopwords=stop)
unigram_df = pd.DataFrame(uni_grams_sw, columns = ['Summary' , 'count'])
figure = unigram_df.groupby('Summary').sum()['count'].sort_values(ascending=False)
figure.head(10)
# Top Bigrams after removing stop words
top_n = 50
ngram_range = (2,2)
bi_grams_sw = get_top_n_ngrams(df_tranlated_inc.cleaned_description, top_n, ngram_range, stopwords=stop)
bigrams_df = pd.DataFrame(bi_grams_sw, columns = ['Summary' , 'count'])
figure2 = bigrams_df.groupby('Summary').sum()['count'].sort_values(ascending=False)
figure2.head(10)
# Wordcloud after data cleaning
wordcloudImg3 = WordCloud().generate(str(df_tranlated_inc["cleaned_description"]))
plt.figure(figsize=(30,20))
plt.imshow(wordcloudImg3, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
1.Lemmatization is the process of process of reducing a word to its root form by grouping together the different inflected forms of a word so they can be analysed as a single item.
2.It helps to reduce variations of the same word, thereby reducing the corpus of words to be included in the model.
So,it returns the base or dictionary form of a word, which is known as the lemma.It is important when clean the data of all words of a given root.Lemmatizing considers the context of the word and shortens the word into its root form based on the dictionary definition.
# Lemmatization
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nlp = spacy.load('en', disable=['parser', 'ner'])
allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
def lemmatize_text(text):
doc = nlp(text)
return ' '.join([token.lemma_ for token in doc if token.lemma_ !='-PRON-'])
df_tranlated_inc["cleaned_description"] = df_tranlated_inc["cleaned_description"].apply(lemmatize_text)
df_tranlated_inc['cleaned_description'].head()
# Wordcloud for corpus after data cleaning, removal of stop words & lemmatization
from wordcloud import WordCloud
def generate_word_cloud(corpus,x):
wordcloudImg4 = WordCloud(width = 500, height = 500,
background_color ='black',
stopwords=stop,
min_font_size = 10).generate(corpus)
# plot the WordCloud image
plt.figure(figsize = (12, 12), facecolor = None)
plt.imshow(wordcloudImg4, interpolation="bilinear")
plt.axis("off")
plt.title("Most common words of {}".format(x))
plt.tight_layout(pad = 0)
plt.show()
generate_word_cloud(str(df_tranlated_inc['cleaned_description']),1)
valueCts = df_tranlated_inc['Assignment group'].value_counts().sort_values(ascending=False).index
valueCts
# Generate Worldcloud
for i in range(16):
generate_word_cloud(' '.join(df_tranlated_inc[df_tranlated_inc['Assignment group'] == valueCts[i]].cleaned_description.str.strip()),valueCts[i])
Observations
It is clear from the n-gram analysis and the word cloud that in the dataset, most issues are related to:
Sample Analysis on GRP_0: It is the most frequent group and most of the tickets assigned to this group, shows us that this group deals with mostly the maintenance problems such as password reset, account lock, login issue, ticket update etc.
Maximum of the tickets from GRP_0 for human intervention can be reduced by putting automation scripts/mechanisms to help resolve these common maintenance issues. This will help in lowering the inflow of service tickets which need human intervention and thereby saving the resource/hour efforts spend and reducing cost involved for man hours.
df_tranlated_inc['num_wds'] = df_tranlated_inc['cleaned_description'].apply(lambda x: len(x.split()))
df_tranlated_inc['num_wds'].mean()
print(df_tranlated_inc['num_wds'].max())
print(df_tranlated_inc['num_wds'].min())
len(df_tranlated_inc[df_tranlated_inc['num_wds']==0])
Here we remove those records from our dataframe for which the no of words in the cleaned description is less than 2 words
Hence we have considered those records where the 'num_wds' is greater than 1
df_tranlated_inc= df_tranlated_inc[df_tranlated_inc['num_wds']>1]
df_tranlated_inc.shape
The new dataframe has 8424 records now.
76 records had no: of words less than 2 and hence they were removed.
print(df_tranlated_inc['num_wds'].max())
print(df_tranlated_inc['num_wds'].min())
def avg_word(sentence):
words = sentence.split()
return (sum(len(word) for word in words)/len(words))
df_tranlated_inc['avg_word'] = df_tranlated_inc['cleaned_description'].apply(lambda x: avg_word(str(x)))
df_tranlated_inc.head()
Visualize a distribution of the description word counts to see how skewed our average might be by outliers. Let's generate another plot to take a look
ax=df_tranlated_inc['num_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('Description Length in Words\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Words', fontsize=18);
Number of unique words in each article
df_tranlated_inc['uniq_wds'] = df_tranlated_inc['cleaned_description'].str.split().apply(lambda x: len(set(x)))
df_tranlated_inc['uniq_wds'].head()
average (mean) number of unique words per incident, and the minimum and maximum unique word counts.
print("Mean: ",df_tranlated_inc['uniq_wds'].mean())
print("Min: ",df_tranlated_inc['uniq_wds'].min())
print("Max: ",df_tranlated_inc['uniq_wds'].max())
ax=df_tranlated_inc['uniq_wds'].plot(kind='hist', bins=50, fontsize=14, figsize=(12,10))
ax.set_title('Unique Words Per Incident\n', fontsize=20)
ax.set_ylabel('Frequency', fontsize=18)
ax.set_xlabel('Number of Unique Words', fontsize=18);
When we plot this into a chart, we can see that while the distribution of unique words is not skewed..
Mean Number of Words in tickets per Assignment Group
assign_grps = df_tranlated_inc.groupby('Assignment group')
ax=assign_grps['num_wds'].aggregate(np.mean).plot(kind='bar', fontsize=14, figsize=(20,10))
ax.set_title('Mean Number of Words in tickets per Assignment Group\n', fontsize=20)
ax.set_ylabel('Mean Number of Words', fontsize=18)
ax.set_xlabel('Assignment Group', fontsize=18);
Mean Number of Unique Words in tickets per Assignment Group
ax=assign_grps['uniq_wds'].aggregate(np.mean).plot(kind='bar', fontsize=14, figsize=(20,10))
ax.set_title('Mean Number of Unique Words per tickets in Assignment Group\n', fontsize=20)
ax.set_ylabel('Mean Number of Unique Words', fontsize=18)
ax.set_xlabel('Assignment Group', fontsize=18);
Finally, let’s look at the most common words over the entire corpus.
from collections import Counter
wd_counts = Counter()
for i, row in df_tranlated_inc.iterrows():
wd_counts.update(row['cleaned_description'].split())
wd_counts.most_common(20)
df_tranlated_inc.tail()
Tokenization is a process that splits an input sequence into so-called tokens where the tokens can be a word, sentence, paragraph etc.
import nltk
# Tokenizing the training and the test set
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
df_tranlated_inc['token_desc'] = df_tranlated_inc['cleaned_description'].apply(lambda x: tokenizer.tokenize(x))
df_tranlated_inc['token_desc'].head()
# After preprocessing, the text format
def combine_text(list_of_text):
'''Takes a list of text and combines them into one large chunk of text.'''
combined_text = ' '.join(list_of_text)
return combined_text
df_tranlated_inc['token_desc'] = df_tranlated_inc['token_desc'].apply(lambda x : combine_text(x))
df_tranlated_inc.describe().T
df_tranlated_inc.info()
df_tranlated_inc.to_csv("cleaned_data.csv",index=False)
df_tranlated_inc = pd.read_csv('cleaned_data.csv')
df_tranlated_inc.head()
Let's create a copy of the clean df for modeling purpose.
ticket_df = df_tranlated_inc
ticket_df.head()
#Recalling the top 10 grouops with highest ticket count
df_top_10 = ticket_df['Assignment group'].value_counts().nlargest(10).reset_index()
df_top_10
Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. It can help with the following:
For example, let’s say a document belongs to the topics food, dogs and health. So if a user queries “dog food”, they might find the above-mentioned document relevant because it covers those topics(among other topics). We are able to figure its relevance with respect to the query without even going through the entire document.
Therefore, by annotating the document, based on the topics predicted by the modeling method, we are able to optimize our search process.
Here for topic modeliing, we are using Genism.
Gensim = “Generate Similar” is a popular open source natural language processing (NLP) library used for unsupervised topic modeling. It uses top academic models and modern statistical machine learning to perform various complex tasks such as −
Apart from performing the above complex tasks, Gensim, implemented in Python and Cython, is designed to handle large text collections using data streaming as well as incremental online algorithms. This makes it different from those machine learning software packages that target only in-memory processing.
# Gensim
import gensim
import gensim.corpora as corpora
#Remove stemming(snowball stemming) add lemmatistaion using simple_process from gensim
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim
#to process the simple_process gensim package as input needed as string
combined_text=ticket_df.cleaned_description.values.tolist()
#Convert the combined text from each sentense to the words. use of simple_process as it tokenize() internally
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(combined_text))
Note : Bigram is 2 consecutive words in a sentence. Trigram is 3 consecutive words in a sentence.
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
# Form Bigrams
data_words_bigrams = make_bigrams(data_words)
wordclouds=' '.join(map(str, data_words_bigrams))
#Copying to new dataframe to create wordclouds on target class
new_df = ticket_df.copy()
new_df['words'] = data_words_bigrams
new_df.head()
new_df.tail()
#Sorting based on frequency of target class Assignment group
value = new_df['Assignment group'].value_counts().sort_values(ascending=False).index
value
It is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.
# Create Dictionary
id2word = corpora.Dictionary(data_words_bigrams)
# Create Corpus from post clean data
texts = data_words_bigrams
# Term Document Frequency and Bag of words
corpus = [id2word.doc2bow(text) for text in texts]
# Build LDA model
lda_model = LdaModel(corpus=corpus,id2word=id2word,num_topics=7,random_state=200,update_every=1,chunksize=800,passes=10,alpha='auto',per_word_topics=True)
#top 7 topics from the corpus
from pprint import pprint
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
texts=data_words_bigrams
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) # a measure of how good the model is. lower the better.
In natural language processing, perplexity is a way of evaluating language models.
A language model is a probability distribution over entire sentences or texts.
A low perplexity indicates the probability distribution is good at predicting the sample.
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
LDAvis_prepared
Visualize data distribution in the dataset
#Lets get the visual representation
descending_order = ticket_df['Assignment group'].value_counts().sort_values(ascending=False).index
plt.subplots(figsize=(22,5))
ax=sns.countplot(x='Assignment group', data=ticket_df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.tight_layout()
plt.show()
Even after data cleaning process we can observe a lot of imbalance in the data distribution.
This imbalance in the dataset demands for further processing of the dataset with proper measures like resampling to build better performing prediction models.
# shape before resampling
ticket_df.shape
ticket_df.head()
Overview of this step:
Traditional machine learning algorithms meant for classification problem solving will be tried against the vectorized features generated out of TF-IDF.
Comparison of the model accuracy for selecting best performing model.
Analyse & check for possible improvements, if required.
import warnings
# Traditional Modeling
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from xgboost.sklearn import XGBClassifier
# Tools & Evaluation metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, auc, roc_auc_score
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve, f1_score, recall_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
def multiclass_logloss(actual, predicted, eps=1e-15):
"""Multi class version of Logarithmic Loss metric.
:param actual: Array containing the actual target classes
:param predicted: Matrix with class predictions, one probability per class
"""
# Convert 'actual' to a binary array if it's not already:
if len(actual.shape) == 1:
actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
for i, val in enumerate(actual):
actual2[i, val] = 1
actual = actual2
clip = np.clip(predicted, eps, 1 - eps)
rows = actual.shape[0]
vsota = np.sum(actual * np.log(clip))
return -1.0 / rows * vsota
log_cols=["Classifier","Training accuracy","Testing accuracy","f1-score","Recall"]
log = pd.DataFrame(columns=log_cols)
emptyArr = []
cbArr = []
import sys
# A method to train and test the model
def run_classification(Prediction_model, X_train, X_test, y_train, y_test, cbArr, arch_name=None, pipelineRequired=True, isDeepModel=False) :
clf = Prediction_model
if pipelineRequired :
clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', Prediction_model),])
if isDeepModel :
if cbArr:
clf.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=10, batch_size=128, verbose=True, callbacks=cbArr)
else:
clf.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=10, batch_size=128,verbose=1)
# predict from the classifier
y_pred = clf.predict(X_test)
y_predOrginal = clf
y_pred = np.argmax(y_pred, axis=1)
y_train_pred = clf.predict(X_train)
y_train_pred = np.argmax(y_train_pred, axis=1)
else :
clf.fit(X_train, y_train)
# predict from the classifier
y_pred = clf.predict(X_test)
y_predOrginal = clf
y_train_pred = clf.predict(X_train)
np.set_printoptions(threshold=np.inf)
print('Prediction Model:', Prediction_model)
print('-'*80)
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))
print('-'*80)
#print('Confusion matrix:\n %s' % (confusion_matrix(y_test, y_pred)))
cm = confusion_matrix(y_test, y_pred)
print(cm.shape)
print('Confusion Matrix:\n')
print(cm)
print("\n")
print('-'*80)
print('Classification report:\n %s' % (classification_report(y_test, y_pred)))
log_entry = []
nameVar = ""
nameVar = str(Prediction_model)
predModName = []
predModName = nameVar.split("(")[0]
log_entry = pd.DataFrame([[predModName,accuracy_score(y_train,y_train_pred),accuracy_score(y_test,y_pred),f1_score(y_test,y_pred,average='weighted'),recall_score(y_test,y_pred,average='weighted')]], columns=log_cols)
return log_entry, y_predOrginal
# return log.append(log_entry)
df_inc_sample2 = ticket_df[ticket_df['Assignment group'].map(ticket_df['Assignment group'].value_counts()) > 0]
x2 = ticket_df['cleaned_description']
y2 = ticket_df['Assignment group']
Traditional ML Models considered:
# Create training and test datasets with 80:20 ratio
X_train2, X_test2, y_train2, y_test2 = train_test_split(x2,
y2,
test_size=0.20,
random_state=13)
print('\033[1mShape of the training set:\033[0m', X_train2.shape, X_test2.shape)
print('\033[1mShape of the test set:\033[0m', y_train2.shape, y_test2.shape)
# Copy of Original X_train2 & X_test2
X_trainOrg2 = X_train2
X_testOrg2 = X_test2
# Copy of Original of y_train2 & y_test2
y_trainOrg2 = y_train2
y_testOrg2 = y_test2
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
# encoding train labels
y_train2 = encoder.fit_transform(y_trainOrg2)
y_test2 = encoder.fit_transform(y_testOrg2)
Naive Bayes method is a probabilistic model on set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.Naive Bayes classifier can be extremely fast compared to more sophisticated methods.
log_cols3=["Classifier","Training accuracy","Testing accuracy","f1-score","Recall"]
log3 = pd.DataFrame(columns=log_cols3)
emptyArr = []
cbArr = []
logTmp = []
logTmp, y_pred_MultinomialNB2 = run_classification(MultinomialNB(), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Naive Bayes Model - without Sampling"
log3 = log3.append(logTmp)
log3
The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these.Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.
logTmp = []
logTmp, y_pred_KNeighbors2 = run_classification(KNeighborsClassifier(), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "KNN Model - without Sampling"
log3 = log3.append(logTmp)
log3
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
Linear SVM
The algorithm creates a line or a hyperplane which separates the data into classes.
The advantages of support vector machines are:
Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function.
# SVM with Linear kernel
logTmp = []
logTmp, y_pred_LinearSVC2 = run_classification(LinearSVC(), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Linear SVM Model - without Sampling"
log3 = log3.append(logTmp)
log3
Decision Trees Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.
logTmp = []
logTmp, y_pred_DecisionTree2 = run_classification(DecisionTreeClassifier(), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Decision Trees Model - without Sampling"
log3 = log3.append(logTmp)
log3
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.Essentially, Random Forest is a good model if we want high performance with less need for interpretation.
from sklearn.ensemble import RandomForestClassifier
logTmp = []
logTmp, y_pred_RandomForest2 = run_classification(RandomForestClassifier(n_estimators=100), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Random Forest Classifier Model - without Sampling"
log3 = log3.append(logTmp)
log3
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
from sklearn.linear_model import LogisticRegression
logTmp = []
logTmp, y_pred_LogReg2 = run_classification(LogisticRegression(), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Logistic Regression Model - without Sampling"
log3 = log3.append(logTmp)
log3
An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
logTmp = []
dtc2 = DecisionTreeClassifier(max_depth=7)
logTmp, y_pred_AdaBoost2 = run_classification(AdaBoostClassifier(n_estimators= 200, base_estimator=dtc2, learning_rate=0.1, random_state=22), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "ADA Boost Classifier Model - without Sampling"
log3 = log3.append(logTmp)
log3
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
logTmp = []
logTmp, y_pred_Bagging2 = run_classification(BaggingClassifier(n_estimators=100, max_samples= .7, bootstrap=True, oob_score=True, n_jobs=4, random_state=22), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "Bagging Classifier Model - without Sampling"
log3 = log3.append(logTmp)
log3
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve the problem in a fast and accurate way.
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
logTmp = []
logTmp, y_pred_XGBoost2 = run_classification(XGBClassifier(n_estimators=100, max_depth=7, min_child_weight=6, colsample_bytree=0.8, subsample=0.8,n_jobs=4, learning_rate=0.1), X_train2, X_test2, y_train2, y_test2, emptyArr)
logTmp['Classifier'] = "XG Boost Classifier Model - without Sampling"
log3 = log3.append(logTmp)
log3
logObs3 = log3
logObs3.set_index(["Classifier"],inplace=True)
logObs3.sort_values(by=['f1-score'])
logObs3.sort_values(by=['f1-score']).plot(kind='barh',figsize=[15,10])
Observation:
We first analysed the dataset provided to us, undestood the structure of the data provided - number of columns, field , datatypes etc.
We did Exploratory Data Analysis to derive further insights from this data set and we found that Data is very much imbalanced, there are around ~45% of the Groups with less than 20 tickets.
Few of the tickets are in foreign language like German The data has lot of noise in it, for eg- few tickets related to account setup are spread across multiple assignment groups. We performed the data cleaning, google translation and preprocessing.
Here in this comparison of different traditional ML models, we can observe a substantial difference in the accuracy of training and test sets. Major reason is possibly the imbalanced data distribution of the dataset used and the inability of the model to learn and adapt during training.
We need to check if these issues can be handled by resampled data and using deep learning techniques.
Deep learning is a subset of machine learning where artificial neural networks, algorithms inspired by the human brain, learn from large amounts of data. Similarly to how we learn from experience, the deep learning algorithm would perform a task repeatedly, each time tweaking it a little to improve the outcome. We refer to ‘deep learning’ because the neural networks have various (deep) layers that enable learning. Deep Learmning Models considered:
from keras.models import Sequential, Model
from keras.preprocessing import sequence
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dropout, Flatten, Dense, Embedding, LSTM, GRU
from keras.layers import BatchNormalization, TimeDistributed, Conv1D, MaxPooling1D, SpatialDropout1D
from keras.preprocessing.text import Tokenizer
from keras.layers.merge import Concatenate
# Create embedding matrix
EMBEDDING_FILE = 'glove.6B.200d.txt'
MAX_SEQUENCE_LENGTH = 500
EMBEDDING_DIM=200
MAX_NB_WORDS=400000
# Function to generate Embedding
def loadData_Tokenizer(X_train, X_test,filename):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=' ', char_level=False)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
text = text[indices]
print(text.shape)
X_train = text[0:len(X_train), ]
X_test = text[len(X_train):, ]
embeddings_index = {}
f = open(filename, encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_index[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return (X_train, X_test, word_index,embeddings_index)
embedding_matrix = []
def buildEmbed_matrices(word_index,embedding_dim):
embedding_matrix = np.random.random((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) != len(embedding_vector):
print("Could not broadcast input array from shape",str(len(embedding_matrix[i])), "into shape",str(len(embedding_vector)),
" Please make sure your"" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector
return embedding_matrix
The embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with index i. Simple enough. Load the GloVe matrix you prepared into the embedding layer, the first layer in the model.
# Generate Glove embedded datasets
X_train_Glove, X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train2,X_test2,EMBEDDING_FILE)
embedding_matrix = buildEmbed_matrices(word_index,EMBEDDING_DIM)
Defining the model can be done with below points:
we've used Keras' Sequential() to instantiate a model. It takes a group of sequential layers and stacks them together into a single model. Into the Sequential() constructor, we pass a list that contains the layers we want to use in our model.
We've made several Dense layers and a single Dropout layer in this model. We've made the input_shape equal to the Maximum Sequenec Length
We defined Embedding Layer on the first layer as the input of that layer.
There are 200 neurons in Convolution layers and It has relu as activation funtionand 128 neurosn for LSTM layer.
There's 100 neurons in dense layer. This is typically up to testing - putting in more neurons per layer will help extract more features, but these can also sometimes work against the goal.It has relu as activation funtion
Finally, we have a Dense layer with size as the output layer. It has the Softmax activation function.
At last, we measure the loss with ategorical cross entropy fucntion, The efficient ADAM optimization algorithm is used to find the weights and the accuracy metric is calculated and reported each epoch.
def Build_Model_RNN_Text(word_index, embeddings_matrix,ytrain,nclasses,dropout=0.5):
model = Sequential()
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embeddings_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
model.add(Input(shape=(MAX_SEQUENCE_LENGTH,),dtype=tf.int64))
model.add(embedding_layer)
model.add(Conv1D(200,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.3))
model.add(Conv1D(200,10,activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(128))
model.add(Dropout(0.3))
model.add(Dense(100,activation='relu'))
model.add(Dense(len((pd.Series(ytrain)).unique()),activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
print(model.summary())
return model
model_RNN = Build_Model_RNN_Text(word_index,embedding_matrix,y_train2,17)
log_cols2=["Classifier","Training accuracy","Testing accuracy","f1-score","Recall"]
log2 = pd.DataFrame(columns=log_cols2)
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 4)
mc = ModelCheckpoint('rnn_model.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
rl = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
logTmp2 = []
cbArr = [es, mc, rl]
logTmp2, y_pred_RNNmodel = run_classification(model_RNN, X_train_Glove, X_test_Glove, y_train2, y_test2, cbArr, pipelineRequired = False,isDeepModel=True)
logTmp2['Classifier'] = "RNN-LSTM Model - without Sampling"
log2 = log2.append(logTmp2)
log2
Model can be defined as:
we've used Keras' Sequential() to instantiate a model. It takes a group of sequential layers and stacks them together into a single model. Into the Sequential() constructor, we pass a list that contains the layers we want to use in our model.
We've made several Dense layers and a single Dropout layer in this model. We've made the input_shape equal to the Maximum Sequence Length
We defined Embedding Layer on the first layer as the input of that layer.
We added a GRU function layer with 128 neurons.
There's 100 neurons in dense layer. This is typically up to testing - putting in more neurons per layer will help extract more features, but these can also sometimes work against the goal.It has relu as activation funtion
Finally, we have a Dense layer with size as the output layer. It has the Softmax activation function.
At last, we measure the loss with ategorical cross entropy fucntion, The efficient ADAM optimization algorithm is used to find the weights and the accuracy metric is calculated and reported each epoch.
# Build GRU model
def Build_Model_RNNGRU_Text(word_index, embeddings_matrix, ytrain, nclasses, dropout=0.5):
input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype=tf.int64)
embed = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embeddings_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=True)(input_layer)
gru=GRU(128)(embed)
drop=Dropout(0.3)(gru)
dense =Dense(100,activation='relu')(drop)
out=Dense(len((pd.Series(ytrain)).unique()),activation='softmax')(dense)
model = Model(input_layer,out)
# Compile the model
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
# Print model summary
print(model.summary())
return model
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 4)
mc = ModelCheckpoint('rnnGru_model.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)
rl = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_RNNGRU = Build_Model_RNNGRU_Text(word_index,embedding_matrix,y_train2,17)
logTmp2 = []
cbArr = [es, mc, rl]
logTmp2, y_pred_RNNGRUmodel = run_classification(model_RNNGRU, X_train_Glove, X_test_Glove, y_train2, y_test2, cbArr, pipelineRequired = False,isDeepModel=True)
logTmp2['Classifier'] = "RNN-GRU Model - without Sampling"
log2 = log2.append(logTmp2)
log2
len(word_index) + 1
Model can be defined as:
We created two copies of the hidden layer, one fit in the input sequences as-is and one on a reversed copy of the input sequence. By default, the output values from these LSTMs will be concatenated.
We defined Embedding Layer on the first layer as the input of that layer.
We added a LSTM layer with 128 neurons.
There's 100 neurons in dense layer. This is typically up to testing - putting in more neurons per layer will help extract more features, but these can also sometimes work against the goal.It has relu as activation funtion
Finally, we have a Dense layer with size as the output layer. It has the Softmax activation function.
At last, we measure the loss with ategorical cross entropy fucntion,The efficient ADAM optimization algorithm is used to find the weights and the accuracy metric is calculated and reported each epoch.
def Build_Model_BiDirLSTM_Text(word_index, embeddings_matrix,ytrain):
vocab_size = len(word_index) + 1
input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype=tf.int64)
embed = Embedding(input_dim=vocab_size, output_dim=200, weights=[embeddings_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=True)(input_layer)
lstm = Bidirectional(LSTM(128))(embed)
drop = Dropout(0.3)(lstm)
dense = Dense(100,activation='relu')(drop)
out = Dense(len((pd.Series(ytrain)).unique()),activation='softmax')(dense)
model = Model(input_layer,out)
# Compile the model
model.compile(loss='sparse_categorical_crossentropy',optimizer="adam",metrics=['accuracy'])
# Print model summary
print(model.summary())
return model
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 4)
mc = ModelCheckpoint('bidirlstm_model.h5', verbose=1, monitor='val_accuracy',save_best_only=True, mode='auto')
rl = ReduceLROnPlateau(monitor='val_loss', factor=0.2,patience=2, min_lr=0.0001)
model_BiDirLSTM = Build_Model_BiDirLSTM_Text(word_index,embedding_matrix,y_train2)
logTmp2 = []
cbArr = [es, mc, rl]
logTmp2, y_pred_BiDirLSTM = run_classification(model_BiDirLSTM, X_train_Glove, X_test_Glove, y_train2, y_test2, cbArr, pipelineRequired = False,isDeepModel=True)
logTmp2['Classifier'] = "BiDirectional LSTM Model - without Sampling"
log2 = log2.append(logTmp2)
log2
logObs2 = log2
logObs2.set_index(["Classifier"],inplace=True)
logObs2.sort_values(by=['f1-score'])
logObs2.sort_values(by=['f1-score']).plot(kind='barh',figsize=[15,10])
The difference between training accuracy and testing accuracy is not high. There is scope for much improvement in deep learning models and the testing accuracies of these models look promising.
The model seems to be overfitting even with high need of tuning the model, but we still observe the overfitting already
The low accuracy is suspected to be due to imbalanced dataset used for training and testing.
We have to work upon the resampling the data to make the model work better
If we see in the above results we see that RNN-LSTM MOdel has more room to be tuned without getting overfitted
We need to explore ways to improve & fine tune the model performance without overfitting.
Data Imbalance Rationalisation: Data set will be resampled based on multiple approaches:
a. Creating seperate single target group for not well represeted groups(may be with 20 or less assigned tickets) and then classify against highly represented group.
b. The reclassify the group (cluster of meagre groups) in to the original groups.
c. May be total drop off of some groups that are very sparsely represented (may be with less than 5 observations).